NHGIS Data

import numpy as np
import matplotlib.pyplot as plt  # To visualize
import pandas as pd  # To read data
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from stargazer.stargazer import Stargazer
from IPython.core.display import HTML
import seaborn as sn

First, we are loading in the required libraries and data. I am using python for this because I had trouble getting stata to work well.

data1 = pd.read_csv('nhgis/3.csv', encoding = "ISO-8859-1")
data2 = pd.read_csv('nhgis/4.csv', encoding = "ISO-8859-1")

This next section is simply me merging the two datasets together. Then, I created a statistic that represents the fraction of the population with a B.A. or higher. Finally, I isolated the columns we needed for this assignment in the DataFrame.

data = pd.merge(data1, data2, on='GISJOIN')
data['PctBA'] = ((data['ALWGE022'] + data['ALWGE023'] + data['ALWGE024'] + data['ALWGE025'])/data['ALWGE001'])
data = data[['GISJOIN', 'CBSA_x', 'CBSAA_x', 'ALWGE001', 'ALX5E001', 'AMEME001', 'PctBA']]
data.columns = ['GISJOIN', 'CBSA_Name', 'CBSA_Code', 'Population', 'Income_Per_Capita', 'Gini', 'Percent_with_BA']

Question G:

We decided to not restrict the size, since the data tables we downloaded were already based around CBSA’s, and the lowest CBSA in the dataset was already over 6000 people large, which seemed large enough to be accurate.

Quesiton H

Just from sorting the values, the least educated CBSA’s all seem to a lot lower in population than the higher educated CBSAs. In addition, they appear to have a lower average per capita income. The Gini seems somewhat lower in the lesser educated CBSAs, however.

data.sort_values('Percent_with_BA')
GISJOIN CBSA_Name CBSA_Code Population Income_Per_Capita Gini Percent_with_BA
658 G37770 Pearsall, TX Micro Area; San Antonio-New Braun... 37770 12449 19256 0.5087 0.072857
173 G17500 Clewiston, FL Micro Area; Cape Coral-Fort Myer... 17500 25671 19167 0.4681 0.082973
704 G39700 Raymondville, TX Micro Area; Brownsville-Harli... 39700 13296 14888 0.4650 0.088598
82 G13500 Bennettsville, SC Micro Area 13500 19108 17456 0.4782 0.088916
828 G44900 Summerville, GA Micro Area; Chattanooga-Clevel... 44900 17107 18715 0.4476 0.089320
... ... ... ... ... ... ... ...
403 G27060 Ithaca, NY Metro Area 27060 59774 33075 0.4960 0.535183
201 G18700 Corvallis, OR Metro Area 18700 55359 33817 0.4801 0.541484
31 G11460 Ann Arbor, MI Metro Area 11460 229268 41399 0.4848 0.559271
105 G14500 Boulder, CO Metro Area 14500 212755 46826 0.4848 0.620766
497 G31060 Los Alamos, NM Micro Area; Albuquerque-Santa F... 31060 13260 60746 0.3909 0.674057

938 rows × 7 columns

Regression 1

This Regression is an Simple Linear Regression of Income Per Capita on Total Population (city size)

Income_Pop = sm.OLS(data['Income_Per_Capita'], sm.add_constant(data['Population'])).fit()
print(Income_Pop.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:      Income_Per_Capita   R-squared:                       0.097
Model:                            OLS   Adj. R-squared:                  0.096
Method:                 Least Squares   F-statistic:                     100.3
Date:                Mon, 11 Oct 2021   Prob (F-statistic):           1.70e-22
Time:                        14:06:38   Log-Likelihood:                -9467.9
No. Observations:                 938   AIC:                         1.894e+04
Df Residuals:                     936   BIC:                         1.895e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.752e+04    199.705    137.821      0.000    2.71e+04    2.79e+04
Population     0.0026      0.000     10.014      0.000       0.002       0.003
==============================================================================
Omnibus:                      118.872   Durbin-Watson:                   1.983
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              408.315
Skew:                           0.587   Prob(JB):                     2.17e-89
Kurtosis:                       6.012   Cond. No.                     8.16e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.16e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

We observe a statistically significant positive relationship between Income Per Capita and the Size of a city

Next we run a regression of Gini on Population:

Pop_Gini = sm.OLS(data['Gini'], sm.add_constant(data['Population'])).fit()
print(Pop_Gini.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Gini   R-squared:                       0.023
Model:                            OLS   Adj. R-squared:                  0.022
Method:                 Least Squares   F-statistic:                     21.79
Date:                Mon, 11 Oct 2021   Prob (F-statistic):           3.49e-06
Time:                        14:06:39   Log-Likelihood:                 1956.8
No. Observations:                 938   AIC:                            -3910.
Df Residuals:                     936   BIC:                            -3900.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.4521      0.001    441.058      0.000       0.450       0.454
Population  6.123e-09   1.31e-09      4.667      0.000    3.55e-09     8.7e-09
==============================================================================
Omnibus:                       53.772   Durbin-Watson:                   1.909
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               71.513
Skew:                           0.510   Prob(JB):                     2.96e-16
Kurtosis:                       3.888   Cond. No.                     8.16e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.16e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

We get a somewhat confusing result. The coefficient is extremely small, but also has a significant p value. I think this is simply because the relationship is there, but gini coefficients are a much smaller number than the population, so the coefficients are very very small. If you reverse it and have population as the dependent variable, you get the following:

Pop_Gini_Reversed = sm.OLS(data['Population'], sm.add_constant(data['Gini'])).fit()
print(Pop_Gini_Reversed.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Population   R-squared:                       0.023
Model:                            OLS   Adj. R-squared:                  0.022
Method:                 Least Squares   F-statistic:                     21.79
Date:                Mon, 11 Oct 2021   Prob (F-statistic):           3.49e-06
Time:                        14:06:40   Log-Likelihood:                -14008.
No. Observations:                 938   AIC:                         2.802e+04
Df Residuals:                     936   BIC:                         2.803e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1.461e+06   3.62e+05     -4.038      0.000   -2.17e+06   -7.51e+05
Gini        3.715e+06   7.96e+05      4.667      0.000    2.15e+06    5.28e+06
==============================================================================
Omnibus:                     1493.733   Durbin-Watson:                   1.978
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           684543.407
Skew:                           9.648   Prob(JB):                         0.00
Kurtosis:                     133.930   Cond. No.                         39.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Finally we get to the regression of Gini on Income per Capita. There is a significant but negative relationship between the two variables.

Income_Gini = sm.OLS(data['Gini'], sm.add_constant(data['Income_Per_Capita'])).fit()

print(Income_Gini.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Gini   R-squared:                       0.022
Model:                            OLS   Adj. R-squared:                  0.021
Method:                 Least Squares   F-statistic:                     20.65
Date:                Mon, 11 Oct 2021   Prob (F-statistic):           6.25e-06
Time:                        14:06:40   Log-Likelihood:                 1956.3
No. Observations:                 938   AIC:                            -3909.
Df Residuals:                     936   BIC:                            -3899.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 0.4738      0.005    103.256      0.000       0.465       0.483
Income_Per_Capita -7.249e-07    1.6e-07     -4.544      0.000   -1.04e-06   -4.12e-07
==============================================================================
Omnibus:                       29.614   Durbin-Watson:                   1.892
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               34.674
Skew:                           0.373   Prob(JB):                     2.96e-08
Kurtosis:                       3.575   Cond. No.                     1.34e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.34e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
stargazer = Stargazer([Income_Pop, Pop_Gini, Income_Gini])
stargazer.significant_digits(4)

stargazer.custom_columns(['Income on Population', 'Gini on Population', 'Gini on Income'], [1, 1, 1])
HTML(stargazer.render_html())
Income on PopulationGini on PopulationGini on Income
(1)(2)(3)
Income_Per_Capita-0.0000***
(0.0000)
Population0.0026***0.0000***
(0.0003)(0.0000)
const27523.4610***0.4521***0.4738***
(199.7046)(0.0010)(0.0046)
Observations938938938
R20.09680.02270.0216
Adjusted R20.09580.02170.0205
Residual Std. Error5859.5028 (df=936)0.0301 (df=936)0.0301 (df=936)
F Statistic100.2890*** (df=1; 936)21.7852*** (df=1; 936)20.6460*** (df=1; 936)
Note: *p<0.1; **p<0.05; ***p<0.01

Above is the table comparing all of the outputted regression results.